EDA Master Control

Author

Josh, Joel, Corinn, Gaby

Published

February 17, 2026

Introduction

Business Problem

MasterControl is a SaaS provider offering two core software suites: QX (Quality Solutions) and MX (Manufacturing Solutions). While QX has over 20 years of market maturity, MX is a newer offering launched approximately four years ago. MX currently underperforms QX with a 12.7% lead progression rate, compared to 19.7% for QX. Leadership believes the current targeting strategy results in sales pursuing leads that are less likely to convert, creating missed opportunities and inefficient resource allocation.

Analytic Problem

Our analytic task is to identify the industries, company characteristics, and job titles associated with higher MX progression rates. The target variable is Lead Outcome, where a lead is considered successful if it reached SQL, SQO, or Won.

The goal of EDA is to:
- Understand the structure and quality of the data
- Examine relationships between lead attributes and Lead Outcome
- Identify which features appear predictive and should be prioritized in modeling
- Detect missingness, inconsistency, and outlier behavior

Scope

The output of this project will be recommendations for: - Which industries to target
- Which job titles represent high-value decision makers
- How Marketing and Sales can prioritize outreach for the MX product
- Improvements to prospect data-capture on the MasterControl website

Predicting churn or customer retention is out of scope.

Success Metrics

  • Primary metric: Lead progression rate (SQL/SQO/Won)
  • Goal: Increase MX progression from 12.7% → 16–18%, moving toward the QX benchmark of 19.7%

Success means enabling MasterControl Sales to focus on high-potential profiles that historically progress at above-average rates.

Project Management

Team members include Corinn, Josh, Joel, and Gaby, meeting weekly with the following milestones:
- Feb 1 — Business problem statement finalized
- Feb 22 — EDA complete
- Mar 22 — Modeling & evaluation
- Apr 8 — Final presentation


Questions to Guide EDA

  1. Data Quality
    • What variables exist? How many leads?
    • Are job titles and industry columns clean or messy?
    • Which variables have missing values?
    • Is the missing data random, or does it follow a pattern?
    • Are there any duplicate leads?
  2. Target Variable
    • What is the overall MX success rate?
    • Is the target imbalanced?
  3. Job Titles
    • Which job title patterns (e.g., Director, Manager, Engineer) appear most in converted leads?
    • Which technical/manufacturing roles convert best?
  4. Industry
    • Which industries show the highest conversion rates?
    • Are specific NAICS groups highly predictive?
    • Can differences between conversion rates of the products be explained by industry differences?
  5. Account Characteristics
    • Do company size, revenue, or region relate to successful MX outcomes?
  6. Interactions
    • Do specific title + industry combinations show high conversion?
  7. Modeling Implications
    • Which variables look most promising?
    • Which variables need transformations or cleaning?

Data Exploration

Setup

# install.packages
pacman::p_load(tidyverse,stringr,tidytext,quanteda,quanteda.textplots,janitor,skimr,recipes,
               cluster,factoextra,FactoMineR,proxy,dendextend,pheatmap,Rtsne,fpc)

Load Data

# Load Data
leads <- read_csv("QAL Performance for MSBA.csv")


# Basic structure
dim(leads)
[1] 16815    14
glimpse(leads)
Rows: 16,815
Columns: 14
$ `acct primary site function`   <chr> "Food and beverage", "Small-molecule AP…
$ `acct manufacturing model`     <chr> "Consumer Packaged Goods", "In-House", …
$ `acct target industry`         <chr> "Non-Life Science", "Pharma & BioTech",…
$ `contact/lead title`           <chr> "QA Manager", "VP Quality&Regulatory Af…
$ `QAL ID`                       <chr> "a0dPQ000005WbfuYAC", "a0dPQ000008ye7dY…
$ `contact/lead id`              <chr> "0030c00002XPeNGAA1", "0030c00002XR1uPA…
$ next_stage__c                  <chr> "SQL", "Recycled", "Recycled", "Recycle…
$ Priority                       <chr> "P1 - Webinar Demo", "No Priority", "Pr…
$ `acct territory rollup`        <chr> "Americas", "EMEA", "EMEA", "EMEA", "Am…
$ `acct tier rollup`             <chr> "Medium", "Medium", "Small", "Small", "…
$ solution                       <chr> "Mx", "Qx", "Qx", "Qx", "Mx", "Mx", "Qx…
$ solution_rollup                <chr> "Mx", "Qx", "Qx", "Qx", "Mx", "Mx", "Qx…
$ `last tactic campaign channel` <chr> "Online Ads", "Outbound Prospecting", "…
$ `qal cohort date`              <date> 2025-06-16, 2025-09-19, 2024-02-22, 20…
# Clean column names for easier coding
leads <- clean_names(leads)
names(leads)
 [1] "acct_primary_site_function"   "acct_manufacturing_model"    
 [3] "acct_target_industry"         "contact_lead_title"          
 [5] "qal_id"                       "contact_lead_id"             
 [7] "next_stage_c"                 "priority"                    
 [9] "acct_territory_rollup"        "acct_tier_rollup"            
[11] "solution"                     "solution_rollup"             
[13] "last_tactic_campaign_channel" "qal_cohort_date"             
# Quick preview
head(leads)
# A tibble: 6 × 14
  acct_primary_site_function acct_manufacturing_model acct_target_industry
  <chr>                      <chr>                    <chr>               
1 Food and beverage          Consumer Packaged Goods  Non-Life Science    
2 Small-molecule API         In-House                 Pharma & BioTech    
3 Sterile injectables        In-House                 Pharma & BioTech    
4 Sterile injectables        In-House                 Pharma & BioTech    
5 Biologic API               In-House                 Pharma & BioTech    
6 Surgical instruments       In-House                 Medical Device      
# ℹ 11 more variables: contact_lead_title <chr>, qal_id <chr>,
#   contact_lead_id <chr>, next_stage_c <chr>, priority <chr>,
#   acct_territory_rollup <chr>, acct_tier_rollup <chr>, solution <chr>,
#   solution_rollup <chr>, last_tactic_campaign_channel <chr>,
#   qal_cohort_date <date>
# Basic summary statistics
summary(leads)
 acct_primary_site_function acct_manufacturing_model acct_target_industry
 Length:16815               Length:16815             Length:16815        
 Class :character           Class :character         Class :character    
 Mode  :character           Mode  :character         Mode  :character    
                                                                         
                                                                         
                                                                         
 contact_lead_title    qal_id          contact_lead_id    next_stage_c      
 Length:16815       Length:16815       Length:16815       Length:16815      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   priority         acct_territory_rollup acct_tier_rollup     solution        
 Length:16815       Length:16815          Length:16815       Length:16815      
 Class :character   Class :character      Class :character   Class :character  
 Mode  :character   Mode  :character      Mode  :character   Mode  :character  
                                                                               
                                                                               
                                                                               
 solution_rollup    last_tactic_campaign_channel qal_cohort_date     
 Length:16815       Length:16815                 Min.   :2024-01-01  
 Class :character   Class :character             1st Qu.:2024-05-22  
 Mode  :character   Mode  :character             Median :2024-12-03  
                                                 Mean   :2024-12-02  
                                                 3rd Qu.:2025-06-05  
                                                 Max.   :2026-01-05  
# Check missing values 
colSums(is.na(leads))
  acct_primary_site_function     acct_manufacturing_model 
                        6877                         6875 
        acct_target_industry           contact_lead_title 
                          45                         6408 
                      qal_id              contact_lead_id 
                           0                            3 
                next_stage_c                     priority 
                         483                            0 
       acct_territory_rollup             acct_tier_rollup 
                         133                           45 
                    solution              solution_rollup 
                           0                            0 
last_tactic_campaign_channel              qal_cohort_date 
                           0                            0 
#view data set
#view(leads)

# Check for duplicate rows by qal_id
sum(duplicated(leads$qal_id))
[1] 0
# Check which leads are duplicated
leads$qal_id[duplicated(leads$qal_id)]
character(0)

Dataset Overview

The leads dataset contains 16,815 rows and 14 columns. Each row represents a Qualified Account Lead (QAL) for MasterControl. After cleaning column names, the primary fields fall into the following categories:

Account / site context:

  • acct_primary_site_function: describes the primary activity at the customer site (e.g., Food and beverage, Small-molecule API, Sterile injectables).
  • acct_manufacturing_model: indicates how manufacturing is managed (e.g., Consumer Packaged Goods, In-House, Hospital/Investor model).
  • acct_target_industry: the industry segment of the account (e.g., Pharma & BioTech, Medical Device, Non-Life Science).

Contact-level information:

  • contact_lead_title: the job title of the lead or contact (e.g., QA Manager, VP Quality & Regulatory Affairs).
  • contact_lead_id: identifier for the individual contact; may repeat if a person is tied to multiple QAL records.

QAL identifiers and lead outcomes:

  • qal_id: unique identifier for the QAL record.
  • next_stage_c: the lead outcome or progression stage (e.g., SQL, SQO, Won, Recycled). This is our target variable identifier. This will be converted into a binary success indicator for modeling.

Account segmentation and territory:

  • acct_territory_rollup: geographic region (Americas, EMEA, APAC & Oceania).
  • acct_tier_rollup: account size/tier (Small, Medium, Large), useful for segmentation.

Product and marketing channel:

  • solution: specific MasterControl product family engaged (Mx, Qx, Ax, Logbooks).
  • solution_rollup: product family grouping used to distinguish MX from QX.
  • last_tactic_campaign_channel: most recent marketing touchpoint (e.g., Email, Online Ads, Outbound Prospecting).

Time

  • qal_cohort_date: date when the QAL entered the pipeline, ranging from early 2024 to early 2026.

All variables except the cohort date are categorical and will be treated as factors during modeling. Contact lead title, despite being categorical will not be treated as a factor given the number of unique value that exist. The dataset includes a mix of account attributes, contact information, product engagement, and progression outcomes—providing multiple angles to explore what drives higher MX success. There are 0 duplicated leads in the dataset, meaning every qal_id is unique, therefore no leads were imputed twice and will only be counted once in analysis. The next step is to evaluate missingness and data quality across these fields to understand how each variable will behave in downstream analysis.

Missing Data

# Count missing values per column
missing_values <- leads %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "n_missing") %>%
  mutate(
    pct_missing = 100 * n_missing / nrow(leads)
  ) %>%
  arrange(desc(pct_missing))

missing_values
# A tibble: 14 × 3
   variable                     n_missing pct_missing
   <chr>                            <int>       <dbl>
 1 acct_primary_site_function        6877     40.9   
 2 acct_manufacturing_model          6875     40.9   
 3 contact_lead_title                6408     38.1   
 4 next_stage_c                       483      2.87  
 5 acct_territory_rollup              133      0.791 
 6 acct_target_industry                45      0.268 
 7 acct_tier_rollup                    45      0.268 
 8 contact_lead_id                      3      0.0178
 9 qal_id                               0      0     
10 priority                             0      0     
11 solution                             0      0     
12 solution_rollup                      0      0     
13 last_tactic_campaign_channel         0      0     
14 qal_cohort_date                      0      0     

Missing Data Overview

The dataset contains several fields with notable missingness. The table below shows the number and percentage of missing values for each variable. Three variables have substantial missingness:

  • acct_primary_site_function — 40.9% missing
  • acct_manufacturing_model — 40.9% missing
  • contact_lead_title — 38.1% missing

These three variables represent important account and contact attributes. Their high level of missingness suggests that many QAL records lack basic information about what the site does, how it manufactures, or who the contact person is. Because job titles are central to this project, the missingness in contact_lead_title will be especially important in modeling and may require grouping into an “Unknown” category.

Moderate missingness appears in:

  • next_stage_c — 2.87% missing
    These records have no known lead outcome and will be excluded when computing success rates.

  • acct_territory_rollup — 0.79% missing

Low missingness (less than 0.3%) appears in:

  • acct_target_industry (45 missing)
  • acct_tier_rollup (45 missing)
  • contact_lead_id (3 missing)

These complete variables can be used reliably for segmentation and cohort-based analysis.

Overall, the missing data pattern indicates that contact and site-level details are inconsistently collected, while product, territory, and outcome fields are much more complete. For modeling, we will likely need to consolidate missing categories rather than impute them, since these fields represent high-cardinality categorical attributes.

Verify Missing Values

According to the data dictionary, the percentage of missing information across variables is minimal, with the exception of job titles. Specifically, the documentation reports:

  • contact/lead title: 29.21% missing
  • **next_stage__c:** 2.37% missing
  • acct_territory_rollup: 0.79% missing
  • acct_manufacturing_model: 0.61% missing
  • acct_primary_site_function: 0.36% missing
  • All other variables: less than 0.3% missing

Based on this documentation, account-level variables such as industry, manufacturing model, and site function are expected to be highly complete and reliable for segmentation analysis.

To validate these assumptions, we will verify whether the dataset used in this analysis reflects the same missingness levels reported in the data dictionary.

# Check blank strings (optional)
sum(leads$acct_primary_site_function == "", na.rm = TRUE)
[1] 0
sum(leads$acct_manufacturing_model == "", na.rm = TRUE)
[1] 0
dict_missing <- tibble::tribble(
  ~variable, ~dict_pct_missing,
  "contact_lead_title", 29.21,
  "next_stage_c", 2.37,
  "acct_territory_rollup", 0.79,
  "acct_manufacturing_model", 0.61,
  "acct_primary_site_function", 0.36,
  "acct_tier_rollup", 0.27,
  "acct_target_industry", 0.27,
  "contact_lead_id", 0.02
)

compare_missing <- na_summary %>%
  left_join(dict_missing, by = "variable") %>%
  mutate(diff_pct = round(pct_missing - dict_pct_missing, 2)) %>%
  arrange(desc(pct_missing))

compare_missing
# A tibble: 14 × 5
   variable                     n_missing pct_missing dict_pct_missing diff_pct
   <chr>                            <int>       <dbl>            <dbl>    <dbl>
 1 acct_primary_site_function        6877       40.9              0.36     40.5
 2 acct_manufacturing_model          6875       40.9              0.61     40.3
 3 contact_lead_title                6408       38.1             29.2       8.9
 4 next_stage_c                       483        2.87             2.37      0.5
 5 acct_territory_rollup              133        0.79             0.79      0  
 6 acct_target_industry                45        0.27             0.27      0  
 7 acct_tier_rollup                    45        0.27             0.27      0  
 8 contact_lead_id                      3        0.02             0.02      0  
 9 qal_id                               0        0               NA        NA  
10 priority                             0        0               NA        NA  
11 solution                             0        0               NA        NA  
12 solution_rollup                      0        0               NA        NA  
13 last_tactic_campaign_channel         0        0               NA        NA  
14 qal_cohort_date                      0        0               NA        NA  
Results

The results show that there are no blank string values in either acct_primary_site_function or acct_manufacturing_model. This confirms that missing information in these variables is recorded as NA rather than empty strings.

As shown in the table comparison, the observed levels of missingness differ substantially from those reported in the data dictionary.

Veracity of the data

Some variables include placeholder categories such as “Not Enough Info Found” and “Unknown,” which reflect incomplete classification rather than true missing data. These will be treated as explicit categories in the analysis to evaluate whether incomplete enrichment affects lead progression.

#unique_labels <- lapply(leads %>% select(where(is.character)), function(x) sort(unique(x)))
#unique_labels

counts_not_enough_unknown <- leads %>%
  select(where(is.character)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  filter(str_detect(str_to_lower(value), "not.*enough|enough.*info|unknown")) %>%
  count(variable, value, sort = TRUE) %>%
  group_by(variable) %>%
  mutate(pct = round(100 * n / nrow(leads), 2)) %>%
  ungroup()

counts_not_enough_unknown
# A tibble: 4 × 4
  variable                   value                     n   pct
  <chr>                      <chr>                 <int> <dbl>
1 acct_primary_site_function Not Enough Info Found  1092  6.49
2 acct_manufacturing_model   Not Enough Info Found   826  4.91
3 acct_target_industry       Unknown                 182  1.08
4 acct_territory_rollup      Unknown                   6  0.04

In addition to true NA values, some variables contain placeholder categories like “Not Enough Info Found” and “Unknown.” These don’t represent simple data entry gaps, but rather cases where the system could not properly classify the account. Combined with the high percentage of NA values, this suggests that account-level information is not fully complete in this dataset. Since missingness itself may be meaningful, these categories will be treated explicitly in the analysis rather than removed.

Analyzing Missing Data

Missingness vs. Lead Progression (Success Rate by Product)

In this section, we test whether missing enrichment fields are associated with lower lead progression (i.e., whether missingness is random or acts as a signal of lead quality).

First, we restrict the dataset to leads with a known outcome (next_stage_c) so we can compute success rates reliably. We then create a binary target variable, success, where leads that reached SQL, SQO, or Won are labeled 1 and all other outcomes are labeled 0.

Next, we create indicator flags that capture whether key enrichment fields are missing: site function (site_missing), manufacturing model (mfg_missing), and job title (title_missing), plus a combined flag (site_or_mfg_missing) that identifies leads missing either site or manufacturing information.

Finally, we summarize the count of leads and the success rate for missing vs not-missing groups, broken out by solution_rollup (Mx, Qx, Ax), and visualize the results to compare progression rates across products and missingness conditions.

# Keep only rows with a known outcome, and create binary success target
leads_clean <- leads %>%
  filter(!is.na(next_stage_c)) %>%
  mutate(success = if_else(next_stage_c %in% c("SQL","SQO","Won"), 1, 0))

# Create missingness flags for the key enrichment fields (plus combined flag)
leads_clean_flags <- leads_clean %>%
  mutate(
    site_missing        = is.na(acct_primary_site_function),
    mfg_missing         = is.na(acct_manufacturing_model),
    title_missing       = is.na(contact_lead_title),
    site_or_mfg_missing = site_missing | mfg_missing
  )

# Reshape to long format and compute counts + success rates
missingness_results <- leads_clean_flags %>%
  pivot_longer(
    cols = c(site_missing, mfg_missing, site_or_mfg_missing, title_missing),
    names_to = "missing_type",
    values_to = "missing_flag"
  ) %>%
  group_by(solution_rollup, missing_type, missing_flag) %>%
  summarise(
    n = n(),
    success_rate = round(mean(success) * 100, 2),
    .groups = "drop"
  ) %>%
  arrange(solution_rollup, missing_type, desc(missing_flag))

plot_df <- missingness_results %>%
  filter(missing_type %in% c("site_or_mfg_missing", "title_missing")) %>%
  mutate(
    missing_flag = if_else(missing_flag, "Missing", "Not missing"),
    missing_type = recode(
      missing_type,
      "site_or_mfg_missing" = "Site OR Manufacturing Missing",
      "title_missing" = "Job Title Missing"
    )
  )

plot_df
# A tibble: 12 × 5
   solution_rollup missing_type                  missing_flag     n success_rate
   <chr>           <chr>                         <chr>        <int>        <dbl>
 1 Ax              Site OR Manufacturing Missing Missing         14         0   
 2 Ax              Site OR Manufacturing Missing Not missing     13        61.5 
 3 Ax              Job Title Missing             Missing         14        21.4 
 4 Ax              Job Title Missing             Not missing     13        38.5 
 5 Mx              Site OR Manufacturing Missing Missing       1360         0.88
 6 Mx              Site OR Manufacturing Missing Not missing   2765        18.8 
 7 Mx              Job Title Missing             Missing       1158         8.89
 8 Mx              Job Title Missing             Not missing   2967        14.5 
 9 Qx              Site OR Manufacturing Missing Missing       5291         1.23
10 Qx              Site OR Manufacturing Missing Not missing   6889        35.0 
11 Qx              Job Title Missing             Missing       5025        12.9 
12 Qx              Job Title Missing             Not missing   7155        25.5 
ggplot(plot_df, aes(x = solution_rollup, y = success_rate, fill = missing_flag)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0("n=", n)),
            position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  facet_wrap(~ missing_type) +
  labs(
    title = "Lead Progression Rate by Missing Enrichment",
    x = "Solution Rollup",
    y = "Success Rate (%)",
    fill = ""
  ) +
  theme_minimal()

Interpretation of Output

Missing enrichment is strongly linked to lower progression, especially for site function + manufacturing model.

  • MX: site_or_mfg_missing = TRUE → 0.88% (n=1,360) vs FALSE → 18.84% (n=2,765).
    title_missing = TRUE → 8.89% (n=1,158) vs FALSE → 14.49% (n=2,967).

  • QX: site_or_mfg_missing = TRUE → 1.23% (n=5,291) vs FALSE → 34.95% (n=6,889).
    title_missing = TRUE → 12.90% (n=5,025) vs FALSE → 25.51% (n=7,155).

  • AX: large differences but very small n (13–14) → interpret cautiously.

Conclusion: missingness is not random and should be kept as a signal (use missingness flags or explicit “Missing/Unknown” categories in modeling).

Find a pattern with missing data

MX Only: Where True Missingness Clusters
Composition of NA-Missing Leads by Channel, Tier, Territory, and Industry

In this section, we focus specifically on MX leads to understand whether missing enrichment fields follow a consistent pattern. We first restrict the data to records with a known outcome (next_stage_c) so success rates are comparable, then create a binary success indicator (SQL/SQO/Won = 1, otherwise = 0). Next, we define two missingness flags: site_or_mfg_missing (missing either site function or manufacturing model) and title_missing (missing job title).

Using these flags, we compare the composition of the missing vs not-missing groups across key categorical dimensions—marketing channel, account tier, territory, and target industry—by calculating within-group percentages. This helps identify where missingness is concentrated (e.g., specific channels or industries), and provides evidence that missing values may reflect systematic differences in lead source or profile rather than random data entry gaps.

mx_miss <- leads %>%
  filter(!is.na(next_stage_c)) %>%
  mutate(success = if_else(next_stage_c %in% c("SQL","SQO","Won"), 1, 0)) %>%
  filter(solution_rollup == "Mx") %>%
  mutate(
    site_or_mfg_missing = is.na(acct_primary_site_function) | is.na(acct_manufacturing_model),
    title_missing = is.na(contact_lead_title)
  )

# Pattern: NA group mix by channel (top 10)
mx_miss %>%
  group_by(site_or_mfg_missing, last_tactic_campaign_channel) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_missing) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_missing, desc(pct)) %>%
  slice_head(n = 10)
# A tibble: 20 × 4
# Groups:   site_or_mfg_missing [2]
   site_or_mfg_missing last_tactic_campaign_channel     n   pct
   <lgl>               <chr>                        <int> <dbl>
 1 FALSE               Email                          774 28.0 
 2 FALSE               External Demand Gen            753 27.2 
 3 FALSE               Online Ads                     463 16.8 
 4 FALSE               SEO                            275  9.95
 5 FALSE               Direct/Inbound                 199  7.2 
 6 FALSE               Events                         137  4.95
 7 FALSE               Outbound Prospecting            82  2.97
 8 FALSE               Directory Listing               33  1.19
 9 FALSE               MISSING/BLANK                   23  0.83
10 FALSE               PR/AR/Social                    12  0.43
11 TRUE                Email                          368 27.1 
12 TRUE                Online Ads                     321 23.6 
13 TRUE                External Demand Gen            179 13.2 
14 TRUE                SEO                            177 13.0 
15 TRUE                Direct/Inbound                 134  9.85
16 TRUE                Directory Listing               71  5.22
17 TRUE                Events                          50  3.68
18 TRUE                Outbound Prospecting            29  2.13
19 TRUE                MISSING/BLANK                   14  1.03
20 TRUE                PR/AR/Social                     9  0.66
# Pattern: NA group mix by tier
mx_miss %>%
  group_by(site_or_mfg_missing, acct_tier_rollup) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_missing) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_missing, desc(pct))
# A tibble: 9 × 4
# Groups:   site_or_mfg_missing [2]
  site_or_mfg_missing acct_tier_rollup     n   pct
  <lgl>               <chr>            <int> <dbl>
1 FALSE               Medium            1846 66.8 
2 FALSE               Small              459 16.6 
3 FALSE               Large              458 16.6 
4 FALSE               Other                2  0.07
5 TRUE                Medium             689 50.7 
6 TRUE                Small              448 32.9 
7 TRUE                Large              157 11.5 
8 TRUE                Other               57  4.19
9 TRUE                <NA>                 9  0.66
# Visualization
mx_miss %>%
  group_by(site_or_mfg_missing, last_tactic_campaign_channel) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_missing) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = reorder(last_tactic_campaign_channel, pct), y = pct, fill = site_or_mfg_missing)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "MX: Channel composition for leads with vs without Site/Manufacturing missing",
    x = "Last Tactic Campaign Channel",
    y = "Share within group",
    fill = "Site/Mfg Missing"
  ) +
  theme_minimal()

#  Pattern: NA group mix by territory (MX)
mx_miss %>%
  group_by(site_or_mfg_missing, acct_territory_rollup) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_missing) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_missing, desc(pct))
# A tibble: 11 × 4
# Groups:   site_or_mfg_missing [2]
   site_or_mfg_missing acct_territory_rollup     n   pct
   <lgl>               <chr>                 <int> <dbl>
 1 FALSE               Americas               1523 55.1 
 2 FALSE               EMEA                    615 22.2 
 3 FALSE               APAC & Oceania          611 22.1 
 4 FALSE               Japan                    16  0.58
 5 TRUE                Americas                667 49.0 
 6 TRUE                EMEA                    388 28.5 
 7 TRUE                APAC & Oceania          263 19.3 
 8 TRUE                <NA>                     30  2.21
 9 TRUE                Japan                     9  0.66
10 TRUE                Unknown                   2  0.15
11 TRUE                Antarctica                1  0.07
# Pattern: NA group mix by target industry (MX) - top 10
mx_miss %>%
  group_by(site_or_mfg_missing, acct_target_industry) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_missing) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_missing, desc(pct)) %>%
  slice_head(n = 10)
# A tibble: 11 × 4
# Groups:   site_or_mfg_missing [2]
   site_or_mfg_missing acct_target_industry     n   pct
   <lgl>               <chr>                <int> <dbl>
 1 FALSE               Pharma & BioTech      1520 55.0 
 2 FALSE               Medical Device         780 28.2 
 3 FALSE               Non-Life Science       369 13.4 
 4 FALSE               Blood & Biologics       86  3.11
 5 FALSE               Unknown                 10  0.36
 6 TRUE                Non-Life Science       693 51.0 
 7 TRUE                Pharma & BioTech       443 32.6 
 8 TRUE                Medical Device         151 11.1 
 9 TRUE                Unknown                 50  3.68
10 TRUE                Blood & Biologics       14  1.03
11 TRUE                <NA>                     9  0.66
Interpretation of the Output (MX)

Missing site/manufacturing enrichment in MX is not random. When this info is missing, leads are more likely to be Small tier (32.94% vs 16.60%) and more likely to come from digital/inbound channels (Online Ads, SEO, Direct/Inbound, Directory Listing), while being less associated with Email/External Demand Gen.

Missingness also varies by region and industry: the missing group is less concentrated in the Americas (49.04% vs 55.08%) and more in EMEA (28.53% vs 22.24%). The strongest pattern is industry—missing leads are much more likely to be Non-Life Science (50.96% vs 13.35%) and less likely to be Pharma & BioTech or Medical Device.

Conclusion: These missing values are not happening by accident. For MX leads, “missing site/manufacturing info” shows up more often in certain types of leads:

  • smaller accounts

  • leads coming from digital/inbound channels (online ads, SEO, direct/inbound, directory)

  • leads from EMEA more than the Americas

  • leads in Non-Life Science more than Pharma & BioTech or Medical Device

This matters because it means “missing” is telling us something about the lead itself and how it entered the system. In practice, missing enrichment can be treated like a warning sign that the lead may be lower quality or less well-qualified, or that the channel/source collects less complete information.

MX Only: Where Low-Enrichment (“Low Info”) Clusters

Composition of Low-Info Leads Using NA + Placeholder Values

In this section, we expand the definition of “missing” enrichment beyond true NA values. For two key fields—acct_primary_site_function and acct_manufacturing_model.We create a Low Info indicator that captures both:

  • True missing values (NA), and

  • Placeholder values that represent incomplete enrichment (e.g., “Unknown”, “Not Enough Info Found”, “MISSING/BLANK”, or empty strings).

Then we focus only on MX leads with a known outcome, and we compare the composition of the Low Info group vs the not Low Info group across:

  • marketing channel (last_tactic_campaign_channel)

  • account tier (acct_tier_rollup)

  • territory (acct_territory_rollup)

  • target industry (acct_target_industry)

mx_lowinfo <- leads %>%
  filter(!is.na(next_stage_c)) %>%
  mutate(success = if_else(next_stage_c %in% c("SQL","SQO","Won"), 1, 0)) %>%
  filter(solution_rollup == "Mx") %>%
  mutate(
    # define "low info" as NA OR placeholders
    site_lowinfo =
      is.na(acct_primary_site_function) |
      str_to_lower(str_trim(acct_primary_site_function)) %in% c("unknown", "not enough info found", "missing/blank", ""),
    mfg_lowinfo =
      is.na(acct_manufacturing_model) |
      str_to_lower(str_trim(acct_manufacturing_model)) %in% c("unknown", "not enough info found", "missing/blank", ""),
    site_or_mfg_lowinfo = site_lowinfo | mfg_lowinfo
  )

# 1) Channel mix
mx_lowinfo %>%
  group_by(site_or_mfg_lowinfo, last_tactic_campaign_channel) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_lowinfo) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_lowinfo, desc(pct)) %>%
  slice_head(n = 10)
# A tibble: 20 × 4
# Groups:   site_or_mfg_lowinfo [2]
   site_or_mfg_lowinfo last_tactic_campaign_channel     n   pct
   <lgl>               <chr>                        <int> <dbl>
 1 FALSE               External Demand Gen            684 28.4 
 2 FALSE               Email                          683 28.4 
 3 FALSE               Online Ads                     395 16.4 
 4 FALSE               SEO                            218  9.05
 5 FALSE               Direct/Inbound                 162  6.73
 6 FALSE               Events                         124  5.15
 7 FALSE               Outbound Prospecting            77  3.2 
 8 FALSE               Directory Listing               23  0.96
 9 FALSE               MISSING/BLANK                   18  0.75
10 FALSE               Referrals                       12  0.5 
11 TRUE                Email                          459 26.7 
12 TRUE                Online Ads                     389 22.7 
13 TRUE                External Demand Gen            248 14.4 
14 TRUE                SEO                            234 13.6 
15 TRUE                Direct/Inbound                 171  9.96
16 TRUE                Directory Listing               81  4.72
17 TRUE                Events                          63  3.67
18 TRUE                Outbound Prospecting            34  1.98
19 TRUE                MISSING/BLANK                   19  1.11
20 TRUE                PR/AR/Social                    11  0.64
# 2) Tier mix
mx_lowinfo %>%
  group_by(site_or_mfg_lowinfo, acct_tier_rollup) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_lowinfo) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_lowinfo, desc(pct))
# A tibble: 9 × 4
# Groups:   site_or_mfg_lowinfo [2]
  site_or_mfg_lowinfo acct_tier_rollup     n   pct
  <lgl>               <chr>            <int> <dbl>
1 FALSE               Medium            1642 68.2 
2 FALSE               Small              386 16.0 
3 FALSE               Large              379 15.7 
4 FALSE               Other                1  0.04
5 TRUE                Medium             893 52.0 
6 TRUE                Small              521 30.3 
7 TRUE                Large              236 13.7 
8 TRUE                Other               58  3.38
9 TRUE                <NA>                 9  0.52
# 3) Territory mix
mx_lowinfo %>%
  group_by(site_or_mfg_lowinfo, acct_territory_rollup) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_lowinfo) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_lowinfo, desc(pct))
# A tibble: 11 × 4
# Groups:   site_or_mfg_lowinfo [2]
   site_or_mfg_lowinfo acct_territory_rollup     n   pct
   <lgl>               <chr>                 <int> <dbl>
 1 FALSE               Americas               1304 54.2 
 2 FALSE               APAC & Oceania          549 22.8 
 3 FALSE               EMEA                    540 22.4 
 4 FALSE               Japan                    15  0.62
 5 TRUE                Americas                886 51.6 
 6 TRUE                EMEA                    463 27.0 
 7 TRUE                APAC & Oceania          325 18.9 
 8 TRUE                <NA>                     30  1.75
 9 TRUE                Japan                    10  0.58
10 TRUE                Unknown                   2  0.12
11 TRUE                Antarctica                1  0.06
# 4) Target industry mix (top 10)
mx_lowinfo %>%
  group_by(site_or_mfg_lowinfo, acct_target_industry) %>%
  summarise(n = n(), .groups="drop") %>%
  group_by(site_or_mfg_lowinfo) %>%
  mutate(pct = round(100 * n / sum(n), 2)) %>%
  arrange(site_or_mfg_lowinfo, desc(pct)) %>%
  slice_head(n = 10)
# A tibble: 11 × 4
# Groups:   site_or_mfg_lowinfo [2]
   site_or_mfg_lowinfo acct_target_industry     n   pct
   <lgl>               <chr>                <int> <dbl>
 1 FALSE               Pharma & BioTech      1400 58.1 
 2 FALSE               Medical Device         668 27.7 
 3 FALSE               Non-Life Science       253 10.5 
 4 FALSE               Blood & Biologics       80  3.32
 5 FALSE               Unknown                  7  0.29
 6 TRUE                Non-Life Science       809 47.1 
 7 TRUE                Pharma & BioTech       563 32.8 
 8 TRUE                Medical Device         263 15.3 
 9 TRUE                Unknown                 53  3.09
10 TRUE                Blood & Biologics       20  1.16
11 TRUE                <NA>                     9  0.52

Conclusion

For MX, “Low Info” enrichment is a meaningful signal. It is more common for:

  • smaller accounts

  • digital/inbound channels (Online Ads, SEO, Directory Listing)

  • EMEA compared to the Americas

  • Non-Life Science accounts (much higher share than in the complete-enrichment group)

Because these differences are large (especially by industry), we should not treat missing/placeholder values as random noise. Instead, we can keep a Low Info flag as a feature and/or combine missing/placeholder categories into a consistent Low Info level for modeling.

Contact Lead Title - Additional Cleaning Steps

Before we do any in depth cleaning and standardization, we start by checking to see the number of unique and missing values for contact_lead_title field.

This field is free form entry, therefore this field has a huge variation in job titles. In future modeling, this will create additional noise and difficulty in interpretation.

leads |>
  summarise(n = n(), n_unique = n_distinct(contact_lead_title),
            n_missing = sum(is.na(contact_lead_title) | contact_lead_title == ""))
# A tibble: 1 × 3
      n n_unique n_missing
  <int>    <int>     <int>
1 16815     5791      6408

Unique Contact Lead Titles

There are 5,791 unique contact lead titles within the dataset. The goal is to reduce the number of unique values to make grouping/clustering simpler for further analysis.

Pre-Processing Cleaning

Pre-processing steps are completed to reduce the number of unique values. This includes removing punctuation or special characters, numbers, errant spacing, converting to lower case and removing stop words.

#remove special characters
leads <- leads |> mutate(
  contact_lead_title = contact_lead_title |>
    str_to_lower() |>
    str_replace_all("[^a-z\\s]", "") |>
    str_replace_all("\\s+", " ") |>
    str_trim()
)

stop_words_vec <- tidytext::stop_words$word

leads <- leads |>
  mutate(
    contact_lead_title = map_chr(
      contact_lead_title,
      ~ if (is.na(.x)) {
          NA_character_ #preserves NA values for now instead of turning to a string with value of NA
        } else {
          words <- str_split(.x, " ")[[1]]
          paste(words[!words %in% stop_words_vec], collapse = " ")
        }
    )
  )

After this cleaning, we see the number of unique is still quite high, with an improvement of about 8% (445 values).

leads |>
  summarise(n = n(), n_unique = n_distinct(contact_lead_title),
            n_missing = sum(is.na(contact_lead_title) | contact_lead_title == ""))
# A tibble: 1 × 3
      n n_unique n_missing
  <int>    <int>     <int>
1 16815     5346      6460

Pre-Processing Cleaning Output The number of unique values decreased from 5,791 to 5,346. There needs futher cleaning and or title grouping to lower this value.

Common Contact Title Groupings

To try and further reduce the number of unique values, AI was leveraged to come up with a map of common abbreviations and typing errors. This value map was then applied to the data, replacing the values where they occurred.

title_map <- c(
  # --- common misspellings / normalization ---
  "cheif" = "chief",
  "chif" = "chief",
  "cheif executive officer" = "chief executive officer",
  "chief exec officer" = "chief executive officer",
  "chief exec" = "chief executive officer",
  "chief executive" = "chief executive officer",
  "chief operating" = "chief operating officer",
  "chief financial" = "chief financial officer",
  "chief technology" = "chief technology officer",
  "chief information" = "chief information officer",
  "chief marketing" = "chief marketing officer",
  "chief revenue" = "chief revenue officer",
  "chief product" = "chief product officer",
  "chief people" = "chief people officer",
  "chief customer" = "chief customer officer",
  "chief legal" = "chief legal officer",
  "chief compliance" = "chief compliance officer",
  "chief risk" = "chief risk officer",
  "chief security" = "chief security officer",
  "chief data" = "chief data officer",
  "chief analytics" = "chief analytics officer",

  "manger" = "manager",
  "mngr" = "manager",
  "managr" = "manager",
  "dirctor" = "director",
  "directer" = "director",
  "v p" = "vice president",
  "svp" = "senior vice president",
  "evp" = "executive vice president",
  "avp" = "assistant vice president",
  "vpres" = "vice president",
  "vice pres" = "vice president",
  "vice-president" = "vice president",

  "asst" = "assistant",
  "assitant" = "assistant",
  "assisstant" = "assistant",
  "assoc" = "associate",
  "assc" = "associate",
  "sr" = "senior",
  "s r" = "senior",
  "snr" = "senior",
  "senr" = "senior",
  "jr" = "junior",
  "j r" = "junior",
  "jnr" = "junior",
  "lead" = "lead",
  "principal" = "principal",
  "prinicipal" = "principal",
  "pricipal" = "principal",
  "staff" = "staff",
  "intern" = "intern",
  "trainee" = "trainee",
  "apprentice" = "apprentice",

  # --- founders / ownership ---
  "co founder" = "cofounder",
  "co-founder" = "cofounder",
  "cofounder" = "cofounder",
  "founder" = "founder",
  "owner" = "owner",
  "proprietor" = "owner",
  "partner" = "partner",
  "managing partner" = "managing partner",

  # --- C-suite acronyms (expand) ---
  "ceo" = "chief executive officer",
  "coo" = "chief operating officer",
  "cfo" = "chief financial officer",
  "cto" = "chief technology officer",
  "cio" = "chief information officer",
  "cmo" = "chief marketing officer",
  "cro" = "chief revenue officer",
  "cpo" = "chief product officer",
  "cdo" = "chief data officer",              # sometimes "chief digital officer"—watch ambiguity
  "cso" = "chief security officer",          # sometimes "chief strategy officer"—watch ambiguity
  "chro" = "chief human resources officer",
  "clo" = "chief legal officer",
  "cco" = "chief compliance officer",        # sometimes "chief commercial officer"—watch ambiguity
  "ciso" = "chief information security officer",
  "cbo" = "chief business officer",
  "cxo" = "chief executive officer",

  # --- senior leadership / exec roles ---
  "pres" = "president",
  "president" = "president",
  "vice president" = "vice president",
  "senior vice president" = "senior vice president",
  "executive vice president" = "executive vice president",
  "managing director" = "managing director",
  "md" = "managing director",                # ambiguous in healthcare
  "gm" = "general manager",                  # ambiguous but common
  "general mgr" = "general manager",

  # --- management / leadership ---
  "dir" = "director",
  "director" = "director",
  "sr director" = "senior director",
  "senior director" = "senior director",
  "exec director" = "executive director",
  "executive director" = "executive director",

  "mgr" = "manager",
  "manager" = "manager",
  "sr mgr" = "senior manager",
  "senior manager" = "senior manager",
  "asst manager" = "assistant manager",
  "assistant manager" = "assistant manager",

  "supervisor" = "supervisor",
  "team lead" = "team lead",
  "tl" = "team lead",                        # can be noisy—watch ambiguity

  # --- functional common abbreviations ---
  "hr" = "human resources",
  "it" = "information technology",
  "pr" = "public relations",
  "qa" = "quality assurance",
  "qc" = "quality control",
  "ops" = "operations",
  "biz dev" = "business development",
  "bd" = "business development",             # ambiguous (can be “board” in some contexts)
  "sales rep" = "sales representative",
  "rep" = "representative",
  "sr rep" = "senior representative",

  # --- engineering / tech ---
  "swe" = "software engineer",
  "sdet" = "software development engineer in test",
  "devops" = "devops",
  "dev ops" = "devops",
  "secops" = "security operations",
  "data sci" = "data scientist",
  "ds" = "data scientist",                   # ambiguous
  "ml" = "machine learning",
  "ai" = "artificial intelligence",

  # --- product / project (high ambiguity) ---
  "pm" = "project manager",                  # could be product manager
  "pmo" = "project management office",
  "po" = "product owner",
  "scrum master" = "scrum master",

  # --- finance / accounting ---
  "acct" = "accountant",
  "accounting" = "accounting",
  "fp a" = "financial planning and analysis",
  "fpa" = "financial planning and analysis",
  "ap" = "accounts payable",
  "ar" = "accounts receivable",
  "cpa" = "certified public accountant",

  # --- legal / compliance / risk ---
  "gc" = "general counsel",
  "counsel" = "counsel",
  "attorney" = "attorney",
  "compliance" = "compliance",
  "risk" = "risk",
  "audit" = "audit",

  # --- customer / service / success ---
  "cs" = "customer success",                 # ambiguous (computer science)
  "csr" = "customer service representative",
  "support" = "support",
  "customer support" = "customer support",

  # --- misc common role words ---
  "admin" = "administrator",
  "administrator" = "administrator",
  "assistant" = "assistant",
  "coordinator" = "coordinator",
  "specialist" = "specialist",
  "analyst" = "analyst",
  "consultant" = "consultant",
  "engineer" = "engineer",
  "developer" = "developer"
)
replace_word <- function(x, from, to) {
  str_replace_all(x, regex(paste0("\\b", from, "\\b"), ignore_case = TRUE), to)
}
leads <- leads |>
  mutate(
    contact_lead_title = reduce(
      names(title_map),
      .init = contact_lead_title,
      .f = \(acc, k) replace_word(acc, k, title_map[[k]])
    ) |>
      str_replace_all("\\s+", " ") |>
      str_trim()
  )
leads |>
  summarise(n = n(), n_unique = n_distinct(contact_lead_title),
            n_missing = sum(is.na(contact_lead_title) | contact_lead_title == ""))
# A tibble: 1 × 3
      n n_unique n_missing
  <int>    <int>     <int>
1 16815     5244      6460

Contact Lead Title Grouping Output

This further reduces the unique values down by another ~100 values. The problem at this step is that there are varying combinations of common words that can occur in different frequencies.

Contact Lead Title Word Cloud

If we look at a word cloud of the most common 50 words, there is no surprise at what shows up the most frequently.

#remove whitespace
par(mar = c(0,0,0,0), oma = c(0,0,0,0), xaxs = "i", yaxs = "i")

corp <- corpus(leads, text_field = "contact_lead_title")
toks <- tokens(corp)
dfm_mat <- dfm(toks)

textplot_wordcloud(dfm_mat, max_words = 50, max_size = 6, min_size = 0.5)

Word Cloud Output Manager, quality, director, assurance, and senior are the most common words implicated by the word cloud.

Most Common Word Data Frame

To account for the various combination of words, a data frame is created of the most common occurring words. Only words that occur at least 5 times are retained. These words will be used to create a flag variable if that word occurred or not.

top_features <- topfeatures(dfm_mat,10000)

top_features_df <- data.frame(
  feature = names(top_features),
  count = as.numeric(top_features))

count(top_features_df)
     n
1 2137
non_blank <- leads |>
  summarise(n = n() - sum(is.na(contact_lead_title) | contact_lead_title == "")) |>
  pull(n)
  
top_features_df <- top_features_df |> filter(count >= 5)
top_features_df <- top_features_df |> mutate(percentage_occured_non_blank = count / non_blank)
head(top_features_df,10)
      feature count percentage_occured_non_blank
1     quality  3606                   0.34823757
2     manager  3101                   0.29946886
3    director  1741                   0.16813134
4   assurance  1307                   0.12621922
5      senior  1187                   0.11463061
6  operations   776                   0.07493964
7    engineer   708                   0.06837277
8        head   650                   0.06277161
9     officer   585                   0.05649445
10 specialist   559                   0.05398358
tail(top_features_df,10)
        feature count percentage_occured_non_blank
352  technicien     5                 0.0004828585
353    indirect     5                 0.0004828585
354       dscsa     5                 0.0004828585
355  experience     5                 0.0004828585
356   secretary     5                 0.0004828585
357         sap     5                 0.0004828585
358     trainer     5                 0.0004828585
359 transfusion     5                 0.0004828585
360       clerk     5                 0.0004828585
361        lean     5                 0.0004828585

Common Words Data Frame Output

Above shows the top 10 occurring words, as well as 10 that only occur 5 times, and for both of these In total 361 words clear this threshold. It also has the count of unique words in general that occur, of which their are 2,137.

We can see from below that ~34% of titles have “quality”, and ~30% have manager.

Common Words Bi-Grams

Additionally we can look at just bi-grams (sequences of only two words from the title field), if we look at this it does give us some more information in regards to word order. However, the number of unique word combinations then goes back up greatly, so this will not be further evaluated.

toks_bigram <-tokens_ngrams(toks,n = 2)
dfm_bigram <-dfm(toks_bigram)

top_features_bi <- topfeatures(dfm_bigram,10000)

top_features_df_bi <- data.frame(
  feature = names(top_features_bi),
  count = as.numeric(top_features_bi))

count(top_features_df_bi)
     n
1 6693
top_features_df_bi <- top_features_df_bi |> filter(count >= 5)
top_features_df_bi <- top_features_df_bi |> mutate(percentage_occured_non_blank = count / non_blank)
head(top_features_df_bi,10)
                  feature count percentage_occured_non_blank
1       quality_assurance  1277                   0.12332207
2         manager_quality   486                   0.04693385
3        director_quality   395                   0.03814582
4          vice_president   377                   0.03640753
5         quality_manager   350                   0.03380010
6          senior_manager   277                   0.02675036
7      regulatory_affairs   260                   0.02510864
8       assurance_manager   243                   0.02346692
9         quality_control   229                   0.02211492
10 information_technology   192                   0.01854177

Clean Dataset

In this section, we create a cleaned version of the leads dataset that we can use for EDA (and later modeling) without removing rows.

Steps:

  • Keep all original columns

  • Standardize “missing-like” text in categorical fields by converting blanks and placeholder values (e.g., “Unknown”, “Not Enough Info Found”, “MISSING/BLANK”) into true NA.

  • Use a recipes pipeline to replace NA values in categorical predictor columns only with a consistent category level called “Low Info” so incomplete enrichment is explicit and rows are retained.

  • Create indicator flags (site_lowinfo, mfg_lowinfo, title_lowinfo, site_or_mfg_lowinfo) that capture whether key enrichment fields ended up as “Low Info”.

  • We do not modify next_stage_c in the recipe because it represents the outcome; any missing outcomes remain missing and are handled separately when computing success rates.

# Drop rows with missing target/outcome
leads_model <- leads %>%
  filter(!is.na(next_stage_c))

# Convert placeholder text + blanks to NA (for character/factor columns)
to_na_lowinfo <- function(x) {
  x <- str_trim(x)
  x <- na_if(x, "")
  x <- if_else(
    str_to_lower(x) %in% c("missing/blank", "unknown", "not enough info found"),
    NA_character_,
    x
  )
  x
}

rec_clean_only <- recipe(~ ., data = leads_model) %>%
  # Keep IDs as-is (do not clean/encode them as predictors)
  update_role(qal_id, contact_lead_id, new_role = "id") %>%
  
  # Keep outcome column as-is (do not clean it as a predictor)
  update_role(next_stage_c, new_role = "outcome") %>%
  
  # Clean categorical predictors ONLY (excludes id + outcome roles)
  step_mutate_at(all_nominal_predictors(), fn = to_na_lowinfo) %>%
  
  # Replace NA in categorical predictors with a consistent level
  step_unknown(all_nominal_predictors(), new_level = "Low Info") %>%
  
  # Create flags AFTER "Low Info" exists
  step_mutate(
    site_lowinfo        = acct_primary_site_function == "Low Info",
    mfg_lowinfo         = acct_manufacturing_model   == "Low Info",
    title_lowinfo       = contact_lead_title         == "Low Info",
    site_or_mfg_lowinfo = site_lowinfo | mfg_lowinfo
  )

# Apply recipe
prep_clean_only <- prep(rec_clean_only, training = leads_model, retain = TRUE)
leads_cleaned_only <- bake(prep_clean_only, new_data = NULL)

# Quick checks
dim(leads)
[1] 16815    14
dim(leads_cleaned_only)
[1] 16332    18
colSums(is.na(leads_cleaned_only))
  acct_primary_site_function     acct_manufacturing_model 
                           0                            0 
        acct_target_industry           contact_lead_title 
                           0                            0 
                      qal_id              contact_lead_id 
                           0                            2 
                next_stage_c                     priority 
                           0                            0 
       acct_territory_rollup             acct_tier_rollup 
                           0                            0 
                    solution              solution_rollup 
                           0                            0 
last_tactic_campaign_channel              qal_cohort_date 
                           0                            0 
                site_lowinfo                  mfg_lowinfo 
                           0                            0 
               title_lowinfo          site_or_mfg_lowinfo 
                           0                            0 
table(leads_cleaned_only$site_or_mfg_lowinfo, useNA = "ifany")

FALSE  TRUE 
 7859  8473 
  • Once the recipe has done the data cleaning the last step is merging the one hot encoded data back to the main cleaned data frame.
#change from factor before creating corpus/tokens
leads_cleaned_only <- leads_cleaned_only %>%
  mutate(contact_lead_title = as.character(contact_lead_title))

#create corpus and tokens 
corp <- corpus(
  leads_cleaned_only,
  text_field = "contact_lead_title")

toks <- tokens(corp)

toks_onehot <- tokens_select(
  toks,
  pattern = top_features_df$feature)

#create one hot encoded matrix
dfm_onehot <- dfm(toks_onehot)
dfm_onehot_bin <- dfm_weight(dfm_onehot, scheme = "boolean")
df_encoded <- convert(dfm_onehot_bin, to = "data.frame")

#join back together
leads_cleaned_only <-cbind(leads_cleaned_only, df_encoded[,-1])

#check duplicated columns for target variable binary conversion
colnames(leads_cleaned_only)[duplicated(colnames(leads_cleaned_only))]
[1] "solution"
dup_cols <- colnames(leads_cleaned_only)[colnames(leads_cleaned_only) %in% colnames(leads_cleaned_only)[duplicated(colnames(leads_cleaned_only))]]
#dup_cols

#view duplicate columns
#leads_cleaned_only[, dup_cols]

#drop solution.1
leads_cleaned_final <- leads_cleaned_only[, colnames(leads_cleaned_only) != "solution.1"]
#head(leads_cleaned_final)

#check for duplicates again
colnames(leads_cleaned_final)[duplicated(colnames(leads_cleaned_final))]
character(0)
#now convert target to binary
leads_cleaned_final <- leads_cleaned_final %>%
  mutate(
    next_stage_target = factor(
      if_else(next_stage_c %in% c("SQL", "SQO", "Won"), 1, 0)
    )
  )

#view dataframe
#head(leads_cleaned_final)

Interpretation (Clean Dataset checks)

  • Shape: The full dataset (leads) contains 16,815 rows and 14 columns.
    For modeling, we drop rows with missing outcome (next_stage_c), resulting in 16,332 rows.
    After applying the recipe, the cleaned modeling dataset keeps all 16,332 rows and increases to 379 columns because we added 4 flags for low information and 361 for word occurrences :
    site_lowinfo, mfg_lowinfo, title_lowinfo, site_or_mfg_lowinfo.
    When converting the target variable next_stage_c to binary (1,0), the error of duplicate column names was returned. We checked which columns we duplicated, ‘solution’ was the duplicated column, and then removed it. Then we continued with turning the target variable into binary and it worked successfully.

  • Missing values after recipe (colSums(is.na())):

    • acct_primary_site_function, acct_manufacturing_model, contact_lead_title now have 0 NA because NAs + placeholder values were standardized and replaced with the factor level “Low Info”.
    • next_stage_c has 0 NA in the cleaned modeling dataset because rows with missing outcomes were removed before the recipe (filter(!is.na(next_stage_c))).
    • contact_lead_id still has 2 NA (ID column kept as-is and not imputed).
  • Low-info flag (table(site_or_mfg_lowinfo)):

    • FALSE = 7,859: both site + manufacturing enrichment are present (not “Low Info”)
    • TRUE = 8,473: at least one of site or manufacturing is “Low Info”

Key takeaway: After restricting to leads with known outcomes, low-enrichment (“Low Info”) values remain very common and should be treated as signal rather than noise. We standardize missing/placeholder values into a consistent “Low Info” category for categorical predictors and create indicator flags (e.g., site_or_mfg_lowinfo, title_lowinfo) so downstream models can learn the relationship between enrichment quality and lead progression.

Clustering EDA

Clustering reveals multi-dimensional lead segments that simple cross-tabulations cannot surface. Where a bar chart shows one variable at a time (e.g., industry vs. conversion), clustering operates across all dimensions simultaneously — discovering combinations like “Small tier + EMEA + Medical Device + SEO channel” that form coherent, high-converting archetypes invisible in univariate analysis.

We perform four clustering analyses, each targeting a different question:

  1. Account Profile Clustering — What natural account archetypes exist, and how do they convert?
  2. Title-Based Role Clustering — Can we collapse 361 title word indicators into interpretable role families?
  3. MX-Specific Lead Profile Clustering — What distinct Mx lead profiles exist, ranked by conversion?
  4. Success Profile (ICP) Discovery — What do winning Mx leads look like?

Shared Setup: Create Success Variable and Site Function Groupings

# Set up cache directory for expensive clustering computations
cache_dir <- "EDA-Master-Control_cache/clustering"
dir.create(cache_dir, recursive = TRUE, showWarnings = FALSE)

# Fix duplicate column names from cbind() of one-hot encoded title words
# (cbind can produce duplicates if any title word matches an existing column)
dup_cols <- duplicated(names(leads_cleaned_only))
if (any(dup_cols)) {
  cat("Removing", sum(dup_cols), "duplicate columns\n")
  leads_cleaned_only <- leads_cleaned_only[, !dup_cols]
}
Removing 1 duplicate columns
# Create binary success variable on the cleaned dataset
leads_clustered <- leads_cleaned_only %>%
  mutate(
    success = if_else(next_stage_c %in% c("SQL", "SQO", "Won"), 1, 0),
    # Group 78+ site function levels into 5 categories
    # (based on case_description.txt suggested groupings)
    site_function_group = case_when(
      acct_primary_site_function %in% c(
        "Oral solid dosage (OSD)", "Sterile injectables",
        "Small-molecule API", "Biologic API",
        "Cell therapies (CAR-T, NK)", "Generics manufacturing",
        "Drug discovery & development", "Nutraceutical manufacturing"
      ) ~ "Active Pharma Mfg",
      acct_primary_site_function %in% c(
        "Implantable devices", "In-vitro diagnostic (IVD) kits",
        "Diagnostic Equipment", "Surgical instruments",
        "Single-use disposables"
      ) ~ "Med Device Mfg",
      acct_primary_site_function %in% c(
        "Non-manufacturing organization", "Low Info"
      ) ~ "Non-Mfg / Low Info",
      acct_primary_site_function %in% c(
        "Food and beverage", "Cosmetics manufacturing",
        "Specialty Chemical Manufacturing", "Secondary packaging",
        "Lab Instruments"
      ) ~ "Other Mfg",
      TRUE ~ "Other Mfg"
    ),
    site_function_group = factor(site_function_group)
  )

Analysis 1: Account Profile Clustering (Gower + PAM)

We cluster all leads on account-level categorical features to discover natural account archetypes. PAM (Partitioning Around Medoids) with Gower distance is ideal for purely categorical data — medoids are real data points, making clusters directly interpretable.

Prepare Features

acct_df <- leads_clustered %>%
  select(acct_target_industry, acct_manufacturing_model,
         site_function_group, acct_tier_rollup,
         acct_territory_rollup, site_or_mfg_lowinfo) %>%
  mutate(across(everything(), as.factor))

dim(acct_df)
[1] 16332     6

Compute Gower Distance and Optimal k

set.seed(42)

cache_file <- file.path(cache_dir, "gower_acct.rds")
if (file.exists(cache_file)) {
  cached <- readRDS(cache_file)
  gower_dist <- cached$gower_dist
  sil_widths <- cached$sil_widths
  cat("Loaded gower_dist and sil_widths from cache\n")
} else {
  # Gower distance handles mixed/categorical data natively
  gower_dist <- daisy(acct_df, metric = "gower")

  # Search for optimal k via silhouette width
  sil_widths <- sapply(2:10, function(k) {
    pam_fit <- pam(gower_dist, k = k, diss = TRUE)
    pam_fit$silinfo$avg.width
  })

  saveRDS(list(gower_dist = gower_dist, sil_widths = sil_widths), cache_file)
  cat("Computed and cached gower_dist and sil_widths\n")
}
Loaded gower_dist and sil_widths from cache
sil_df <- data.frame(k = 2:10, avg_silhouette = sil_widths)

ggplot(sil_df, aes(x = k, y = avg_silhouette)) +
  geom_line() +
  geom_point(size = 2) +
  geom_point(data = sil_df %>% filter(avg_silhouette == max(avg_silhouette)),
             color = "red", size = 4) +
  labs(title = "Account Clustering: Optimal k by Silhouette Width",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_minimal()

best_k_acct <- sil_df$k[which.max(sil_df$avg_silhouette)]
cat("Optimal k:", best_k_acct, "with avg silhouette:", max(sil_df$avg_silhouette))
Optimal k: 2 with avg silhouette: 0.4302417

The graph shows how good the different number of clusters are. The higher the silhouette the better separation between groups. The optimal k clusters in this plot is 2 with a silhouette score of ~.43. This means that the data is not perfectly separated but reasonably strong grouping because the data naturally splits into 2 main groups/clusters, and using 2 clusters gives separation without too much overlap.

Fit Final PAM Model

cache_file <- file.path(cache_dir, "pam_acct.rds")
if (file.exists(cache_file)) {
  pam_acct <- readRDS(cache_file)
  cat("Loaded pam_acct from cache\n")
} else {
  pam_acct <- pam(gower_dist, k = best_k_acct, diss = TRUE)
  saveRDS(pam_acct, cache_file)
  cat("Computed and cached pam_acct\n")
}
Loaded pam_acct from cache
# Append cluster labels
leads_clustered$account_cluster <- factor(pam_acct$clustering)

# Silhouette plot
fviz_silhouette(pam_acct) +
  labs(title = "Account Cluster Silhouette Plot") +
  theme_minimal()
  cluster size ave.sil.width
1       1 7016          0.36
2       2 9316          0.48

This graph shows how well each individual data point fits into its assigned cluster. Most accounts clearly belong to one cluster or the other. There are few poorly assigned observations into each cluster, meaning the clusters are stable and that the groupings work well, most companies fit naturally into one of the two profiles.

MCA Biplot with Cluster Overlay

# MCA is the categorical analogue of PCA
mca_acct <- MCA(acct_df, graph = FALSE)

# Extract individual coordinates and color by cluster
mca_ind <- data.frame(mca_acct$ind$coord[, 1:2])
mca_ind$cluster <- leads_clustered$account_cluster

ggplot(mca_ind, aes(x = Dim.1, y = Dim.2, color = cluster)) +
  geom_point(alpha = 0.3, size = 0.8) +
  stat_ellipse(level = 0.7, linewidth = 1) +
  labs(title = "Account Clusters in MCA Space",
       x = paste0("Dim 1 (", round(mca_acct$eig[1, 2], 1), "%)"),
       y = paste0("Dim 2 (", round(mca_acct$eig[2, 2], 1), "%)"),
       color = "Cluster") +
  theme_minimal()

Multi Correspondence Analysis: reduces categorical variables into a 2 dimensional space. Each point on the plot is an account/lead. Their position is based on similar features like industry, manufacturing type, etc. The colors/groups are the cluster assignments. - The leads that appear close together share similar characteristics, while the leads that are far apart are more different. As shown, Cluster 1 appears more closely together than Cluster 2, where Cluster 1 has a clear focused grouping, while Cluster 2 is more mixed. Cluster 2 across the MCA space is more spread out, had less similar characteristics and wide variety of regions, company types, or incomplete information.

Key Takeaways:

  • There is structure to the data

  • There is one strong, cohesive lead segment

  • Another grouping is broader and less well-defined

Cluster Composition Heatmap

# Calculate proportion of each category level within each cluster
comp_long <- leads_clustered %>%
  select(account_cluster, acct_target_industry, acct_manufacturing_model,
         site_function_group, acct_tier_rollup, acct_territory_rollup) %>%
  pivot_longer(-account_cluster, names_to = "variable", values_to = "level") %>%
  count(account_cluster, variable, level) %>%
  group_by(account_cluster, variable) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

# Show top levels per variable for readability
top_levels <- comp_long %>%
  group_by(variable, level) %>%
  summarise(total_n = sum(n), .groups = "drop") %>%
  group_by(variable) %>%
  slice_max(total_n, n = 6) %>%
  pull(level)

comp_filtered <- comp_long %>%
  filter(level %in% top_levels) %>%
  mutate(label = paste0(variable, ": ", level))

ggplot(comp_filtered, aes(x = account_cluster, y = label, fill = prop)) +
  geom_tile(color = "white") +
  geom_text(aes(label = scales::percent(prop, accuracy = 1)), size = 2.5) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Account Cluster Composition",
       x = "Cluster", y = "", fill = "Proportion") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 8))

This heat map shows the feature difference between clusters. There are two clusters represented (two account archetypes identified):

  • Cluster 1 (“Core Pharma”): Americas-dominant (57%), Pharma & BioTech (57%), In-House manufacturing (62%), Active Pharma Mfg (31%), Medium tier (60%).

  • Cluster 2 (“Diverse / Low-Info”): More geographically diverse (EMEA 27%, APAC 13%), Medium tier (54%) but more Small accounts (20% vs Cluster 1), higher Non-Mfg/Low Info (61%), more Low Info manufacturing model (60%).

  • Cluster 1 looks more ideal for pharma leads and the other, Cluster 2, is more mixed and less focused.

Conversion Rate by Account Cluster

acct_conv <- leads_clustered %>%
  group_by(account_cluster, solution_rollup) %>%
  summarise(
    n = n(),
    conversion = mean(success) * 100,
    .groups = "drop"
  )

ggplot(acct_conv, aes(x = account_cluster, y = conversion, fill = solution_rollup)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(round(conversion, 1), "%\nn=", n)),
            position = position_dodge(width = 0.9), vjust = -0.3, size = 3) +
  geom_hline(yintercept = 12.7, linetype = "dashed", color = "orange", linewidth = 0.8) +
  geom_hline(yintercept = 19.7, linetype = "dashed", color = "blue", linewidth = 0.8) +
  annotate("text", x = 0.5, y = 13.5, label = "Mx avg (12.7%)", color = "orange", hjust = 0, size = 3) +
  annotate("text", x = 0.5, y = 20.5, label = "Qx avg (19.7%)", color = "blue", hjust = 0, size = 3) +
  labs(title = "Lead Conversion Rate by Account Cluster",
       subtitle = "Dashed lines show overall Mx and Qx averages",
       x = "Account Cluster", y = "Conversion Rate (%)", fill = "Product") +
  theme_minimal()

The graph shows the conversion rate by Cluster and product (MX, QX, AX).

Key conversion finding: Cluster 1 (Core Pharma) has Mx conversion of 17.7% vs Cluster 2 at 7.2%. Cluster 1 also has Qx at 31.5% vs Cluster 2 at 13.1%. The Core Pharma archetype converts at 2.5x the rate for MC.

  • Successful customers are more likely to be from Cluster 1.
  • American based Pharma/BioTech or medical device companies focused and companies with real manufacturing operations.

  • This is a major targeting signal. Winning customers follow a clear “ideal customer profile”.

Cluster Size Summary

cluster_summary <- leads_clustered %>%
  group_by(account_cluster) %>%
  summarise(
    n = n(),
    pct_of_total = round(100 * n() / nrow(leads_clustered), 1),
    mx_conversion = round(mean(success[solution_rollup == "Mx"]) * 100, 2),
    qx_conversion = round(mean(success[solution_rollup == "Qx"]) * 100, 2),
    pct_lowinfo = round(mean(site_or_mfg_lowinfo == TRUE) * 100, 1),
    .groups = "drop"
  )

cluster_summary
# A tibble: 2 × 6
  account_cluster     n pct_of_total mx_conversion qx_conversion pct_lowinfo
  <fct>           <int>        <dbl>         <dbl>         <dbl>       <dbl>
1 1                7016           43          17.7          31.5           0
2 2                9316           57           7.2          13.1          91

Medoid Profiles (Cluster Archetypes)

# The medoid is the most representative lead in each cluster
medoid_rows <- pam_acct$id.med
medoid_profiles <- leads_clustered[medoid_rows, ] %>%
  select(account_cluster, acct_target_industry, acct_manufacturing_model,
         site_function_group, acct_tier_rollup, acct_territory_rollup,
         site_or_mfg_lowinfo)

medoid_profiles
      account_cluster acct_target_industry acct_manufacturing_model
16301               1     Pharma & BioTech                 In-House
143                 2     Non-Life Science                 Low Info
      site_function_group acct_tier_rollup acct_territory_rollup
16301           Other Mfg           Medium              Americas
143    Non-Mfg / Low Info           Medium              Americas
      site_or_mfg_lowinfo
16301               FALSE
143                  TRUE

Business Translation

Each cluster represents a distinct account archetype. Compare the conversion rates above to the MX baseline of 12.7%. Clusters with above-average conversion contain the account profiles that show a strong signal for MX targeting. Clusters with very low conversion (especially those dominated by “Low Info” or “Non-Mfg”) represent targeting waste.


Analysis 2: Title-Based Role Clustering (Jaccard + Hierarchical)

We cluster the 361 binary title word indicators to discover natural “role families” among contacts. Jaccard distance is ideal for binary presence/absence data because it ignores shared zeros — two titles are not considered similar simply because they both lack the word “pharmacist.”

Prepare Title Data

# Identify title word indicator columns (binary flags from one-hot encoding)
title_word_cols <- top_features_df$feature

# Filter to rows with actual titles (not Low Info)
title_df <- leads_clustered %>%
  filter(title_lowinfo == FALSE) %>%
  select(all_of(title_word_cols))

# Convert to numeric matrix for distance computation
title_mat <- as.matrix(title_df)
storage.mode(title_mat) <- "integer"

cat("Title matrix dimensions:", dim(title_mat), "\n")
Title matrix dimensions: 10103 361 
cat("Rows with titles:", nrow(title_mat))
Rows with titles: 10103

Compute Jaccard Distance and Hierarchical Clustering

set.seed(42)

cache_file <- file.path(cache_dir, "jacc_title.rds")
if (file.exists(cache_file)) {
  cached <- readRDS(cache_file)
  jacc_dist <- cached$jacc_dist
  hc_title <- cached$hc_title
  sample_idx <- cached$sample_idx
  title_mat_sample <- title_mat[sample_idx, ]
  cat("Loaded jacc_dist, hc_title, and sample_idx from cache\n")
} else {
  # If dataset is large, sample for distance computation
  if (nrow(title_mat) > 5000) {
    sample_idx <- sample(nrow(title_mat), 5000)
    title_mat_sample <- title_mat[sample_idx, ]
    cat("Sampled", nrow(title_mat_sample), "rows for distance computation\n")
  } else {
    title_mat_sample <- title_mat
    sample_idx <- seq_len(nrow(title_mat))
  }

  # Jaccard distance for binary presence/absence data
  jacc_dist <- proxy::dist(title_mat_sample, method = "Jaccard")

  # Ward's hierarchical clustering
  hc_title <- hclust(jacc_dist, method = "ward.D2")

  saveRDS(list(jacc_dist = jacc_dist, hc_title = hc_title, sample_idx = sample_idx), cache_file)
  cat("Computed and cached jacc_dist, hc_title, and sample_idx\n")
}
Loaded jacc_dist, hc_title, and sample_idx from cache

Silhouette Analysis for Optimal k

sil_title <- sapply(3:12, function(k) {
  cl <- cutree(hc_title, k = k)
  mean(silhouette(cl, jacc_dist)[, "sil_width"])
})

sil_title_df <- data.frame(k = 3:12, avg_silhouette = sil_title)

ggplot(sil_title_df, aes(x = k, y = avg_silhouette)) +
  geom_line() +
  geom_point(size = 2) +
  geom_point(data = sil_title_df %>% filter(avg_silhouette == max(avg_silhouette)),
             color = "red", size = 4) +
  labs(title = "Title Role Clustering: Optimal k by Silhouette Width",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_minimal()

best_k_title <- sil_title_df$k[which.max(sil_title_df$avg_silhouette)]
cat("Optimal k:", best_k_title, "with avg silhouette:", max(sil_title_df$avg_silhouette))
Optimal k: 12 with avg silhouette: 0.07545932

Similar to the silhouette graph from Analysis 1, the graph shows how good the different number of clusters are. The higher the silhouette the better separation between groups. This silhouette has very low scores (.03-.075 range) peaking at k = 12. This indicates there is weak cluster structure in job titles, meaning that job titles are very heterogeneous and don’t form tight groups.

Dendrogram

dend <- as.dendrogram(hc_title)
dend <- color_branches(dend, k = best_k_title)

plot(dend, leaflab = "none",
     main = paste0("Title Role Family Dendrogram (k=", best_k_title, ")"))

The dendrogram is showing a hierarchical clustering of job title role families. The structure shows which roles are most similar, how groups merge together, and how distinct the clusters are. The lower joins the more similar roles, while the higher joins are more different roles. The higher level merging shows that multiple subgroups merge into broader categories and that the difference between role families exist, but they are not extremely distinct. This implies that there is gradual merging rather than clear hard splits of job title groupings.

Cluster-Word Heatmap (What Each Role Family Means)

title_clusters_sample <- cutree(hc_title, k = best_k_title)

# Profile each cluster by word proportion
cluster_word_props <- title_mat_sample %>%
  as.data.frame() %>%
  mutate(cluster = factor(title_clusters_sample)) %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean), .groups = "drop")

# Select top 30 most discriminating words (highest variance across clusters)
word_variance <- apply(cluster_word_props[, -1], 2, var)
top_30_words <- names(sort(word_variance, decreasing = TRUE))[1:30]

heatmap_data <- cluster_word_props %>%
  select(cluster, all_of(top_30_words)) %>%
  tibble::column_to_rownames("cluster") %>%
  as.matrix()

pheatmap(t(heatmap_data),
         main = "Role Family Profiles: Top 30 Discriminating Title Words",
         cluster_rows = TRUE, cluster_cols = FALSE,
         display_numbers = TRUE, number_format = "%.2f",
         fontsize_number = 7, fontsize_row = 9,
         color = colorRampPalette(c("white", "steelblue"))(50))

Similar to account title clustering from Analysis 1, the heat map shows the feature difference between clusters and how different job title families behave across features or outcomes. Some job title clusters will have higher values in certain features and others showing lower engagement or conversion signals. Implying that the job title impacts how leads behave. The higher managerial/director role might have higher engagement in prospecting versus an account/administrative role may have showing weaker signals.

Assign Clusters to Full Dataset

# Assign cluster labels to sampled rows, then assign remaining rows
# to nearest cluster centroid via Jaccard distance to cluster means

# Replace any NAs in title_mat with 0 (missing indicator = "not present")
title_mat[is.na(title_mat)] <- 0L

# Compute cluster centroids (mean binary vectors)
centroids <- title_mat_sample %>%
  as.data.frame() %>%
  mutate(cluster = title_clusters_sample) %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean), .groups = "drop")

centroid_mat <- as.matrix(centroids[, -1])

# Assign all title rows to nearest centroid
assign_cluster <- function(row, centroids) {
  # Jaccard-like distance: 1 - (intersection / union) for binary vs continuous
  dists <- apply(centroids, 1, function(c) {
    intersection <- sum(pmin(row, c, na.rm = TRUE))
    union_val <- sum(pmax(row, c, na.rm = TRUE))
    if (is.na(union_val) || union_val == 0) return(1)
    1 - intersection / union_val
  })
  which.min(dists)
}

all_title_clusters <- apply(title_mat, 1, assign_cluster, centroids = centroid_mat)

# Add to main dataset
leads_clustered$title_cluster <- NA_integer_
title_rows <- which(leads_clustered$title_lowinfo == FALSE)
leads_clustered$title_cluster[title_rows] <- all_title_clusters
leads_clustered$title_cluster <- factor(leads_clustered$title_cluster)

table(leads_clustered$title_cluster, useNA = "ifany")

   1    2    3    4    5    6    7    8    9   10   11 <NA> 
 581  719 3382  378 1416  409  762  668  730  475  583 6229 

Conversion Rate by Role Cluster

title_conv <- leads_clustered %>%
  filter(!is.na(title_cluster)) %>%
  group_by(title_cluster, solution_rollup) %>%
  summarise(
    n = n(),
    conversion = mean(success) * 100,
    .groups = "drop"
  )

ggplot(title_conv, aes(x = title_cluster, y = conversion, fill = solution_rollup)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(round(conversion, 1), "%\nn=", n)),
            position = position_dodge(width = 0.9), vjust = -0.3, size = 3) +
  geom_hline(yintercept = 12.7, linetype = "dashed", color = "orange") +
  geom_hline(yintercept = 19.7, linetype = "dashed", color = "blue") +
  labs(title = "Lead Conversion by Title Role Family",
       subtitle = "Dashed lines: Mx avg (orange) and Qx avg (blue)",
       x = "Role Cluster", y = "Conversion Rate (%)", fill = "Product") +
  theme_minimal()

This graph shows which job title converts more often by product. There are job titles/roles that are much more likely to convert. Targeting the right role matters as much as targeting the right company.

Key conversion finding: Cluster 3 (quality-focused) has the highest MX conversion at 15.2% with n=734 — above the 12.7% MX average. Cluster 5 also shows MX at 11.7% (n=539). Clusters 8 and 9 (IT-focused) have near-zero MX conversion. The quality/assurance role family is the strongest MX conversion signal from titles.

t-SNE Visualization

set.seed(42)

# t-SNE on the sampled title data
tsne_result <- Rtsne(jacc_dist, is_distance = TRUE, perplexity = 30, dims = 2)

tsne_df <- data.frame(
  x = tsne_result$Y[, 1],
  y = tsne_result$Y[, 2],
  cluster = factor(title_clusters_sample)
)

ggplot(tsne_df, aes(x = x, y = y, color = cluster)) +
  geom_point(alpha = 0.4, size = 0.8) +
  labs(title = "Title Role Families in t-SNE Space",
       x = "t-SNE 1", y = "t-SNE 2", color = "Role Cluster") +
  theme_minimal()

This graph shows visual clusters of leads based on features/job titles. There are some spatial separation of clusters but there is significant overlap, which is consistent with the low silhouette scores, in Analysis 1 Title Role Clustering Silhouette. The clusters have weaker structure and don’t form tight groups.

Role Family Summary Table

# For each cluster, show top 5 defining words and conversion rates
role_summary <- leads_clustered %>%
  filter(!is.na(title_cluster)) %>%
  group_by(title_cluster) %>%
  summarise(
    n = n(),
    mx_conversion = round(mean(success[solution_rollup == "Mx"]) * 100, 2),
    qx_conversion = round(mean(success[solution_rollup == "Qx"]) * 100, 2),
    .groups = "drop"
  )

# Add top words from the heatmap data
top_words_per_cluster <- cluster_word_props %>%
  pivot_longer(-cluster, names_to = "word", values_to = "prop") %>%
  group_by(cluster) %>%
  slice_max(prop, n = 5) %>%
  summarise(top_words = paste(word, collapse = ", "), .groups = "drop")

role_summary <- role_summary %>%
  left_join(top_words_per_cluster, by = c("title_cluster" = "cluster"))

role_summary
# A tibble: 11 × 5
   title_cluster     n mx_conversion qx_conversion top_words                    
   <fct>         <int>         <dbl>         <dbl> <chr>                        
 1 1               581         14.1          19.1  president, vice, quality, su…
 2 2               719         15.2          25.6  engineer, quality, senior, s…
 3 3              3382         28.2          37.6  quality, manager, assurance,…
 4 4               378          8.33         21.3  regulatory, affairs, quality…
 5 5              1416         11.7          22.9  director, quality, success, …
 6 6               409          6.67         12.8  quality, assurance, speciali…
 7 7               762         10.6          20.2  senior, manager, quality, di…
 8 8               668          0             0.51 manager, project, success, s…
 9 9               730          4.71          6.88 success, manager, developmen…
10 10              475         18.5          26.4  officer, chief, executive, s…
11 11              583          9.17         26.5  head, quality, operations, s…

Business Translation

Each role cluster represents a distinct job function family. The conversion rates above reveal which role families are most receptive to MX vs. QX. The sales team can use these role families as targeting criteria — for example, if “Operations/Manufacturing Directors” convert at 20%+ for MX while “Quality Specialists” convert at only 8%, that directly informs outbound prospecting priorities.


Analysis 3: MX-Specific Lead Profile Clustering (MCA + k-means)

This is the most directly actionable analysis. We focus exclusively on MX leads and cluster them using all available features to find distinct lead profiles, then rank profiles by conversion rate.

MCA (Multiple Correspondence Analysis) is used for dimensionality reduction — the categorical equivalent of PCA — followed by k-means on the resulting continuous dimensions.

Prepare MX Data

mx_leads <- leads_clustered %>%
  filter(solution_rollup == "Mx") %>%
  mutate(
    # Group manufacturing model to top 8 + Other + Low Info
    mfg_model_group = fct_lump_n(acct_manufacturing_model, n = 8,
                                  other_level = "Other"),
    # Group priority into 4 levels
    priority_group = case_when(
      str_detect(priority, "P1 - (Video|Live|Webinar) Demo|P1 - Website Pricing") ~ "P1 High-Intent",
      str_detect(priority, "P1 - Contact Us|P1 - MQL|P1 - Discount") ~ "P1 Standard",
      priority == "Priority 1" ~ "P1 Standard",
      priority == "Priority 2" ~ "P2",
      TRUE ~ "Low/No Priority"
    ),
    priority_group = factor(priority_group)
  )

# Select clustering features
mx_cluster_df <- mx_leads %>%
  select(acct_target_industry, mfg_model_group, site_function_group,
         acct_tier_rollup, acct_territory_rollup,
         last_tactic_campaign_channel, priority_group,
         site_or_mfg_lowinfo, title_lowinfo) %>%
  mutate(across(everything(), as.factor))

cat("Mx leads for clustering:", nrow(mx_cluster_df), "\n")
Mx leads for clustering: 4125 
cat("Features:", ncol(mx_cluster_df))
Features: 9

MCA Dimensionality Reduction

mca_mx <- MCA(mx_cluster_df, graph = FALSE)

# Screeplot: how many dimensions to retain?
fviz_screeplot(mca_mx, addlabels = TRUE, ncp = 15) +
  labs(title = "MCA Screeplot: Mx Leads") +
  theme_minimal()

# Cumulative inertia
cum_inertia <- cumsum(mca_mx$eig[, 2])
# If 70% inertia is never reached, use all available dimensions
if (any(cum_inertia >= 70)) {
  n_dim <- min(which(cum_inertia >= 70))
} else {
  n_dim <- length(cum_inertia)
}
cat("Dimensions to retain (>=70% inertia):", n_dim, "\n")
Dimensions to retain (>=70% inertia): 22 
cat("Cumulative inertia at", n_dim, "dims:", round(cum_inertia[min(n_dim, length(cum_inertia))], 1), "%")
Cumulative inertia at 22 dims: 71.4 %

This graph displays each MCA dimension and the percentage of variance explained by each dimension. The higher the variance the more important dimension. Dimension 1 explains only 8.6% and captures the main patterns in MX leads characteristics. The visible drop-off after the initial dimensions suggests that most meaningful segmentation can be represented in a 2-3 dimensional space.

MCA Variable Biplot

fviz_mca_var(mca_mx, repel = TRUE, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) +
  labs(title = "MCA Variable Map: Which Category Levels Co-Occur for Mx Leads") +
  theme_minimal()

The MCA Variable Biplot captures how categorical variable levels relate to each other in reduced-dimensional space. Each point represents a level (industry = Pharma, region = Americas, manufacturing type, job title etc). The distance between the points means the closer to each other the more similar or frequently occurring to each other. The further apart the points are, the more different they are to each other. The clusters that appear closer together tend to be the same type of leads. The points positioned far from the origin act as a strong differentiators between lead types, while centrally located variables contribute less to segmentation.

Optimal k for k-means on MCA Scores

# Cap at available dimensions (MCA may produce fewer than n_dim)
n_avail <- ncol(mca_mx$ind$coord)
mca_coords <- mca_mx$ind$coord[, 1:min(n_dim, n_avail, 15)]

# Silhouette method
fviz_nbclust(mca_coords, kmeans, method = "silhouette", k.max = 8) +
  labs(title = "Mx Lead Clustering: Optimal k (Silhouette)") +
  theme_minimal()

Similar to the other silhouette analyses, the graph shows how good the different number of clusters are. Cluster 6 has the highest silhouette score of .35. The range is pretty low from (0.0-0.35). Additionally, similarly to the silhouette analysis in Analysis 2 - Title Job Silhouette, the clustering is weak and doesn’t have strong structure per the low scoring range. There are lead segments but have some overlap.

Fit k-means

set.seed(42)

cache_file <- file.path(cache_dir, "kmeans_mx.rds")
if (file.exists(cache_file)) {
  cached <- readRDS(cache_file)
  sil_km <- cached$sil_km
  km_mx <- cached$km_mx
  cat("Loaded km_mx from cache\n")
} else {
  # Use silhouette-optimal k
  sil_km <- sapply(2:8, function(k) {
    km <- kmeans(mca_coords, centers = k, nstart = 25)
    mean(silhouette(km$cluster, dist(mca_coords))[, "sil_width"])
  })

  best_k_mx <- (2:8)[which.max(sil_km)]
  km_mx <- kmeans(mca_coords, centers = best_k_mx, nstart = 25)

  saveRDS(list(sil_km = sil_km, km_mx = km_mx), cache_file)
  cat("Computed and cached km_mx\n")
}
Loaded km_mx from cache
best_k_mx <- (2:8)[which.max(sil_km)]
cat("Optimal k for Mx clustering:", best_k_mx, "\n")
Optimal k for Mx clustering: 6 
mx_leads$mx_cluster <- factor(km_mx$cluster)

Mx Cluster Individuals in MCA Space

mca_plot_df <- data.frame(
  Dim1 = mca_coords[, 1],
  Dim2 = mca_coords[, 2],
  cluster = mx_leads$mx_cluster
)

ggplot(mca_plot_df, aes(x = Dim1, y = Dim2, color = cluster)) +
  geom_point(alpha = 0.4, size = 1) +
  stat_ellipse(level = 0.7, linewidth = 1) +
  labs(title = "Mx Lead Clusters in MCA Space",
       x = paste0("Dim 1 (", round(mca_mx$eig[1, 2], 1), "%)"),
       y = paste0("Dim 2 (", round(mca_mx$eig[2, 2], 1), "%)"),
       color = "Mx Cluster") +
  theme_minimal()

Similar to the Title Job Clusters in MCA Space in Analysis 1, the MCA space reduces categorical variables into a 2 dimensional space. Each point on the plot is an account/lead. The position is based in similarity based on attributes and the color/group is each cluster assignment. As we can see, there are multiple clusters that overlap supporting the fact that the clustering is weak with not a strong distinction between each cluster.

Mx Conversion Rate by Cluster

mx_conv <- mx_leads %>%
  group_by(mx_cluster) %>%
  summarise(
    n = n(),
    pct_of_mx = round(100 * n() / nrow(mx_leads), 1),
    conversion = round(mean(success) * 100, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(conversion))

ggplot(mx_conv, aes(x = reorder(mx_cluster, -conversion), y = conversion,
                     fill = conversion > 12.7)) +
  geom_col() +
  geom_text(aes(label = paste0(conversion, "%\nn=", n, " (", pct_of_mx, "%)")),
            vjust = -0.3, size = 3.5) +
  geom_hline(yintercept = 12.7, linetype = "dashed", color = "red", linewidth = 0.8) +
  annotate("text", x = 0.5, y = 13.5, label = "Mx avg: 12.7%", color = "red",
           hjust = 0, size = 3) +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "grey60"),
                    guide = "none") +
  labs(title = "Mx Lead Conversion Rate by Cluster",
       subtitle = "Blue = above Mx average, Grey = below",
       x = "Mx Cluster (ordered by conversion)", y = "Conversion Rate (%)") +
  theme_minimal()

mx_conv
# A tibble: 6 × 4
  mx_cluster     n pct_of_mx conversion
  <fct>      <int>     <dbl>      <dbl>
1 3            680      16.5      20.4 
2 1            186       4.5      18.8 
3 5           1455      35.3      15.3 
4 6            930      22.5      12.4 
5 2             53       1.3      11.3 
6 4            821      19.9       1.95

This graph shows the conversion rate of MX product by each lead cluster. Some clusters are more likely to convert than others. Cluster 5 has the highest conversion rate of 16.5%, - Key conversion finding: Cluster 4 represents ~20% of all MX leads but converts at only 2%. These 821 leads are likely targeting waste. Meanwhile, Clusters 5+6+2 (56.3% of leads) convert at 15–21%, well above the 12.7% average. Profiling what distinguishes Cluster 4 from Cluster 5 would directly inform targeting improvements. With Cluster 5 having the highest conversion rate of 16.5%, and in Analysis 1 Cluster 1 (Core Pharma) had a conversion rate of 17.7%, this lines up to suggest that Cluster 5 in this analysis has similarities to Cluster 1 (Core Pharma).

Mx Cluster Profile Summary

# For each cluster, show the dominant category in each feature
mx_profiles <- mx_leads %>%
  group_by(mx_cluster) %>%
  summarise(
    n = n(),
    conversion = round(mean(success) * 100, 2),
    top_industry = names(sort(table(acct_target_industry), decreasing = TRUE))[1],
    top_mfg_model = names(sort(table(mfg_model_group), decreasing = TRUE))[1],
    top_site_fn = names(sort(table(site_function_group), decreasing = TRUE))[1],
    top_tier = names(sort(table(acct_tier_rollup), decreasing = TRUE))[1],
    top_territory = names(sort(table(acct_territory_rollup), decreasing = TRUE))[1],
    top_channel = names(sort(table(last_tactic_campaign_channel), decreasing = TRUE))[1],
    top_priority = names(sort(table(priority_group), decreasing = TRUE))[1],
    pct_lowinfo = round(mean(site_or_mfg_lowinfo == TRUE) * 100, 1),
    .groups = "drop"
  ) %>%
  arrange(desc(conversion))

mx_profiles
# A tibble: 6 × 11
  mx_cluster     n conversion top_industry    top_mfg_model top_site_fn top_tier
  <fct>      <int>      <dbl> <chr>           <chr>         <chr>       <chr>   
1 3            680      20.4  Medical Device  In-House      Med Device… Medium  
2 1            186      18.8  Pharma & BioTe… In-House      Non-Mfg / … Medium  
3 5           1455      15.3  Pharma & BioTe… In-House      Active Pha… Medium  
4 6            930      12.4  Non-Life Scien… Low Info      Non-Mfg / … Medium  
5 2             53      11.3  Low Info        Low Info      Non-Mfg / … Other   
6 4            821       1.95 Pharma & BioTe… Low Info      Non-Mfg / … Medium  
# ℹ 4 more variables: top_territory <chr>, top_channel <chr>,
#   top_priority <chr>, pct_lowinfo <dbl>

Business Translation

The clusters above represent distinct MX lead archetypes. The highest-converting cluster(s) reveal the “golden MX profile” — the combination of industry, company type, channel, and priority that sales should prioritize. Low-converting clusters (especially those with high % Low Info) represent leads where current targeting is inefficient. If MasterControl shifts MX outreach toward the highest-converting cluster, the overall MX progression rate can move toward the 16-18% target.


Analysis 4: Success Profile Clustering — ICP Discovery (Gower + PAM)

We cluster ONLY successful MX leads (SQL/SQO/Won) to reverse-engineer the Ideal Customer Profile. With only ~529 successes, PAM is ideal — the medoid of each cluster is a real winning lead that serves as a concrete archetype.

Prepare Successful Mx Leads

mx_success <- leads_clustered %>%
  filter(solution_rollup == "Mx",
         next_stage_c %in% c("SQL", "SQO", "Won")) %>%
  mutate(
    mfg_model_group = fct_lump_n(acct_manufacturing_model, n = 8,
                                  other_level = "Other"),
    priority_group = case_when(
      str_detect(priority, "P1 - (Video|Live|Webinar) Demo|P1 - Website Pricing") ~ "P1 High-Intent",
      str_detect(priority, "P1 - Contact Us|P1 - MQL|P1 - Discount") ~ "P1 Standard",
      priority == "Priority 1" ~ "P1 Standard",
      priority == "Priority 2" ~ "P2",
      TRUE ~ "Low/No Priority"
    )
  )

cat("Successful Mx leads:", nrow(mx_success))
Successful Mx leads: 533

Select ICP Features

# Top title word indicators for ICP analysis
top_title_words <- c("quality", "manager", "director", "operations",
                     "manufacturing", "regulatory", "vice", "president",
                     "engineer", "senior", "compliance", "systems",
                     "specialist", "chief", "lead")

# Only include title words that exist in the dataset
available_title_words <- intersect(top_title_words, names(mx_success))

icp_df <- mx_success %>%
  select(acct_target_industry, mfg_model_group, site_function_group,
         acct_tier_rollup, acct_territory_rollup,
         last_tactic_campaign_channel, priority_group,
         all_of(available_title_words)) %>%
  mutate(across(where(is.character), as.factor),
         across(where(is.numeric), as.factor))

cat("ICP features:", ncol(icp_df), "\n")
ICP features: 22 
cat("Successful Mx leads:", nrow(icp_df))
Successful Mx leads: 533

Gower Distance + Optimal k

cache_file <- file.path(cache_dir, "gower_icp.rds")
if (file.exists(cache_file)) {
  cached <- readRDS(cache_file)
  gower_icp <- cached$gower_icp
  sil_icp <- cached$sil_icp
  cat("Loaded gower_icp and sil_icp from cache\n")
} else {
  gower_icp <- daisy(icp_df, metric = "gower")

  sil_icp <- sapply(2:6, function(k) {
    pam_fit <- pam(gower_icp, k = k, diss = TRUE)
    pam_fit$silinfo$avg.width
  })

  saveRDS(list(gower_icp = gower_icp, sil_icp = sil_icp), cache_file)
  cat("Computed and cached gower_icp and sil_icp\n")
}
Loaded gower_icp and sil_icp from cache
sil_icp_df <- data.frame(k = 2:6, avg_silhouette = sil_icp)

ggplot(sil_icp_df, aes(x = k, y = avg_silhouette)) +
  geom_line() +
  geom_point(size = 2) +
  geom_point(data = sil_icp_df %>% filter(avg_silhouette == max(avg_silhouette)),
             color = "red", size = 4) +
  labs(title = "ICP Clustering: Optimal k by Silhouette Width",
       x = "Number of Clusters (k)", y = "Average Silhouette Width") +
  theme_minimal()

best_k_icp <- sil_icp_df$k[which.max(sil_icp_df$avg_silhouette)]
cat("Optimal k for ICP:", best_k_icp, "with avg silhouette:", max(sil_icp_df$avg_silhouette))
Optimal k for ICP: 2 with avg silhouette: 0.09993155

Similar to other silhouette analyses, the silhouette range is very low (~.1 to .1). The successful leads don’t form distinct internal clusters. They are relatively homogeneous, which makes sense; successful leads share a common profile. The optimal k value is 2 with a score of .1.

Fit PAM and Extract ICPs

pam_icp <- pam(gower_icp, k = best_k_icp, diss = TRUE)

mx_success$icp_cluster <- factor(pam_icp$clustering)

# Silhouette plot
fviz_silhouette(pam_icp) +
  labs(title = "ICP Cluster Silhouette Plot") +
  theme_minimal()
  cluster size ave.sil.width
1       1  366          0.09
2       2  167          0.13

This graph shows how well each individual data point fits into its assigned cluster. At k = 2 (2 clusters) and a score of .1. As stated in the previous analysis plot, having a score of .1 indicates weak separation between clusters. This suggests that successful leads do not naturally split into clearly distinct subgroups, instead appear relatively homogeneous, which aligns with what was stated in analysis 1 that successful leads share a consistent customer profile.

ICP Archetype Table (Medoid Profiles)

# Each medoid IS a real winning Mx lead — the concrete archetype
medoid_icp_rows <- pam_icp$id.med
icp_archetypes <- mx_success[medoid_icp_rows, ] %>%
  select(icp_cluster, acct_target_industry, mfg_model_group, site_function_group,
         acct_tier_rollup, acct_territory_rollup,
         last_tactic_campaign_channel, priority_group,
         contact_lead_title)

icp_archetypes
    icp_cluster acct_target_industry mfg_model_group site_function_group
457           1       Medical Device        In-House           Other Mfg
527           2     Pharma & BioTech            CDMO   Active Pharma Mfg
    acct_tier_rollup acct_territory_rollup last_tactic_campaign_channel
457           Medium              Americas                   Online Ads
527           Medium              Americas                          SEO
    priority_group    contact_lead_title
457    P1 Standard              Low Info
527    P1 Standard research scientist ii

ICP Cluster Composition vs. Overall MX Population

# Compare: what % of successful Mx leads are in each category vs. all Mx leads?
mx_all_with_cluster <- leads_clustered %>%
  filter(solution_rollup == "Mx") %>%
  mutate(
    mfg_model_group = fct_lump_n(acct_manufacturing_model, n = 8, other_level = "Other")
  )

compare_vars <- c("acct_target_industry", "site_function_group", "acct_tier_rollup",
                   "acct_territory_rollup")

comparison_data <- bind_rows(
  mx_all_with_cluster %>%
    select(all_of(compare_vars)) %>%
    pivot_longer(everything(), names_to = "variable", values_to = "level") %>%
    count(variable, level) %>%
    group_by(variable) %>%
    mutate(prop = n / sum(n), group = "All Mx Leads") %>%
    ungroup(),
  mx_success %>%
    select(all_of(compare_vars)) %>%
    pivot_longer(everything(), names_to = "variable", values_to = "level") %>%
    count(variable, level) %>%
    group_by(variable) %>%
    mutate(prop = n / sum(n), group = "Successful Mx Leads") %>%
    ungroup()
)

ggplot(comparison_data, aes(x = level, y = prop, fill = group)) +
  geom_col(position = "dodge") +
  facet_wrap(~ variable, scales = "free_x") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Successful Mx Leads vs. All Mx Leads: Feature Distribution",
       subtitle = "Over-representation in 'Successful' group = targeting signal",
       x = "", y = "Proportion", fill = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

This table shows how frequently each attribute appears in successful MX leads versus Overall population of MX leads. The goal is to identify which characteristics are over-represented and associated with success and those that are under-represented and less likely to convert.

  • Americas territory has a 53% representation among all MX leads and has a 62% representation of successfully converted MX leads. Leads from America convert more often than expected.

  • APAC and Oceania have a 21% representation among all MX leads and a 14% representation of all successful MX leads. Companies in these regions convert less often.

  • Medical device industry and Small Tier have similar distributions, 22% representation among all MX leads and 27% and 29% representation among all successful MX leads respectively.

  • Large Tier organizations have a really low representation. These companies are less likely to convert.

  • Manufacturing (site function) have a 21% representation among all MX leads and a 35% representation of all successful MX leads. This can be a strong targeting signal.

  • Non-Manufacturing or Low Info companies have a 47% representation of all MX leads and a 27% of all successful MX leads. They have a higher representation, close to Americas territory, but convert significantly less. Almost half as many convert. This is also a strong negative signal.

The bar plots faceted over features/characteristics shows the proportion of all MX leads and successful MX leads per characteristic, similar to the table above but in graph form.

ICP Cluster Summary

icp_summary <- mx_success %>%
  group_by(icp_cluster) %>%
  summarise(
    n = n(),
    pct_of_winners = round(100 * n() / nrow(mx_success), 1),
    top_industry = names(sort(table(acct_target_industry), decreasing = TRUE))[1],
    top_mfg = names(sort(table(mfg_model_group), decreasing = TRUE))[1],
    top_site_fn = names(sort(table(site_function_group), decreasing = TRUE))[1],
    top_tier = names(sort(table(acct_tier_rollup), decreasing = TRUE))[1],
    top_territory = names(sort(table(acct_territory_rollup), decreasing = TRUE))[1],
    top_channel = names(sort(table(last_tactic_campaign_channel), decreasing = TRUE))[1],
    .groups = "drop"
  )

icp_summary
# A tibble: 2 × 9
  icp_cluster     n pct_of_winners top_industry     top_mfg top_site_fn top_tier
  <fct>       <int>          <dbl> <chr>            <chr>   <chr>       <chr>   
1 1             366           68.7 Medical Device   In-Hou… Other Mfg   Medium  
2 2             167           31.3 Pharma & BioTech CDMO    Active Pha… Medium  
# ℹ 2 more variables: top_territory <chr>, top_channel <chr>

ICP summary: The ideal MX customer is an Americas-based, small-to-medium, Pharma/BioTech or Medical Device company with an actual manufacturing function (not Non-Mfg/Low Info). The Non-Mfg/Low Info segment is the single biggest drag on MX conversion — it’s 47% of all MX leads but significantly under-represented among successes.

Business Translation

Each ICP cluster is a data-driven Ideal Customer Profile for MX. The medoid archetype gives the sales team a concrete example: “This exact type of lead, with this title, at this type of company, arriving via this channel, converted.” The comparison chart shows which attributes are over-represented among winners relative to all MX leads — these are the attributes to prioritize in targeting. If 60% of current MX leads match no ICP cluster, that quantifies the targeting waste and the opportunity for improvement.

Results Overview Summary

Data Problems:

  • Missing data was not random and has a pattern; ‘Low Info’ records represent incomplete enrichment or lead source characteristics and should be treated as predictive signals rather than removed from the dataset.

  • Missing site/manufacturing was found to most likely come from small tier, which contradictss our Ideal Customer Profile Cluster analysis which states small-to-medium tier is best. This will be further analyzed when we perform modeling on the dataset to gain firmer confirmation.

  • High variability in job titles created noise; standardizing titles and using one hot encoding improved feature clarity and reduced job title complexity for modeling.

  • We are grouping these leads together and should be treated as a signal rather than these be excluded from the analysis.

  • As stated in the analysis, ‘Missing Info or Low Info’ leads pattern is telling us something about the lead itself and how it entered the system. In practice, missing enrichment can be treated like a warning sign that the lead may be lower quality or less well-qualified, or that the channel/source collects less complete information.

Strong Relationship Identified:

  • Industry alignment and manufacturing site function show the strongest relationship with MX conversion, particularly Pharma/Bio Tech and Medical Device organizations with operational manufacturing relevance.

  • ‘Low Info’ groups leading to a low conversion rate are a strong negative indicator.

Structural Patterns from Clustering:

  • Successful leads are relatively homogeneous (ow silhouette score of .1) suggesting a consistent ideal customer profile (ICP) rather than multiple distinct ideal customer profile subgroups.

  • The ideal MX customer profile is clear; Americas-based, small-to-medium, Pharma/Bio Tech or Medical Device companies.

Impact on Analytics Approach:

  • Targeting the right job title matters as much as targeting the right company/industry.

  • Job title and industry will be the key components in our analytic approach to ensure we are targeting leads that are most likely to convert.

  • Missing/Low Info indicators will be retained allowing models to learn relationships between enrichment quality and conversion outcomes.

  • During modeling, looking we include Small Tier companies and their likelihood of converting to ensure consistency.